# Make sure, that StringsAsFactors = FALSE
options(stringsAsFactors = F)
if(!require(install.load)){
install.packages("install.load")
library(install.load)
}
install_load("tidyverse", "moments", "plotly", "data.table", "fitdistrplus", "knitr")
library(tidyverse)
library(moments)
library(plotly)
library(data.table)
library(fitdistrplus)
library(knitr)
In this task, we have to create a distribution of logistic delay of
component K7. Two data sets were provided, which are
Komponente_K7.csv, which contains production date, and
Logistikverzug_K7.csv, which contains incoming date of the
product. To start analyzing the data, we must import the data sets and
set their column types accordingly.
# For semicolon delimited
K7 <- read_delim("Data/Logistikverzug/Komponente_K7.csv",delim = ";",show_col_types = FALSE)[2:6]
K7$Fehlerhaft <- as.logical(K7$Fehlerhaft)
K7$Produktionsdatum <- as.Date(K7$Produktionsdatum,
format = "%Y-%m-%d")
# For comma delimited
LK7 <- read_delim("Data/Logistikverzug/Logistikverzug_K7.csv",delim = "," ,show_col_types = FALSE)[2:6]
LK7$Fehlerhaft <- as.logical(LK7$Fehlerhaft)
LK7$Wareneingang <- as.Date(LK7$Wareneingang,
format="%Y-%m-%d")
After importing the data, we check if the column Fehlerhaft in both data sets are equal.
Result<- K7$Fehlerhaft == LK7$Fehlerhaft
Result[FALSE]
## logical(0)
Based on this result, the column Fehlerhaft are in both data sets
equal. Now, both data sets K7 and LK7 can be combined by using
full_join(). A column with the name Wochentag is created
for the day of production date. Afterwards, all NA values in the column
Produktionsdatum and Wareneingang are filtered. A column Warenausgang is
created, if the day of production date is Friday and Saturday. If the
production day is Friday and Saturday, the product can be sent only on
Monday, hence the outgoing date equals to production date added by 3 or
2 respectively. For outgoing date of all other days, the production date
is added by 1.
logistics_delay <- K7 %>%
full_join(LK7, by = c("IDNummer","Herstellernummer", "Werksnummer","Fehlerhaft")) %>%
mutate(Wochentag = weekdays(Produktionsdatum)) %>% # Creates a new column Wochentag
filter(!is.na(Produktionsdatum)) %>% # Filter NA values
filter(!is.na(Wareneingang)) %>% # Filter NA values
mutate(Warenausgang = ifelse(Wochentag == "Freitag",
Produktionsdatum +3,
ifelse(Wochentag == "Samstag",
Produktionsdatum +2,
Produktionsdatum +1)))
The column Warenausgang can now be formatted in date-format. Logistic delay can then be calculated by substracting the incoming date (Wareneingang) with the outgoing date (Warenausgang).
logistics_delay <- logistics_delay %>%
mutate(Warenausgang = as.Date(Warenausgang,
origin = "1970-01-01",
format="%Y-%m-%d")) %>%
mutate(Verzug = Wareneingang - Warenausgang)
To determine the distribution of the logistic delay, we use the
function descdist from the package
fitdistrplus. To use this, the column Verzug needs to be
converted in numeric. Afterwards, the column is analyzed for discrete
distribution, since the data contains integer values.
# Convert Verzug column in numeric
logistics_delay$Verzug <- as.numeric(logistics_delay$Verzug)
# Determine the logistics distribution
descdist(logistics_delay$Verzug, discrete = TRUE)
## summary statistics
## ------
## min: 3 max: 14
## median: 6
## mean: 6.080441
## estimated sd: 1.012306
## estimated skewness: 0.567401
## estimated kurtosis: 3.630036
According to the Cullen and Frey graph, the logistics data follows negative binomial distribution. The skewness value is to analyze the symmetrie or the lack of symmetrie of the distribution. Based on the fact that the skewness value is positive and the mean value is slightly larger than median, the distribution is positively skewed, i.e. the tail is on the left side of the distribution (Source: https://www.vrcbuzz.com/karl-pearsons-coefficient-of-skewness-using-r-with-examples/). The kurtosis value is greather than 3, it is said to be leptokurtic.This means that there are more chances to be outliers and that the distribution is peakead and haslong tails (thick).
To calculate the mean, following code is implemented.
mean <- as.numeric(mean(logistics_delay$Verzug))
# Ausgeben der Ergebnisse
cat("Mean = ", mean)
## Mean = 6.080441
To determine the bin width for our histogram, we calculate the difference between Max, in our data equals to 14, and Min value, in our data equals to 1, of the delays and divide it by the total number of bins, in this data equals to 14. So the bin width is:
binwidth <-(14-1)/14
cat("Binwidth = ", binwidth)
## Binwidth = 0.9285714
Since the calculated binwidth is 0,9, it would make much more sense to round it to 1, because the delay days are integer numbers. The data can be plotted as follows:
logistics_delay_plot <- ggplot(logistics_delay, aes(x = Verzug ))+
geom_histogram(aes(y = stat(density)),colour="black", fill="white", binwidth=1)+
scale_x_continuous(breaks = c(0:14))+
geom_density( fill="#FF6666",adjust = 10,alpha = 0.5)
ggplotly(logistics_delay_plot)
To create a decision tree, its important to know what and how many attributes there are, that influence the defect of component K7. Based on the data set K7, there are four columns that could be relevant. These are IDNummer, Produktionsdatum, Herstellernummer and Werksnummer. It could be noticed, that the IDNummer, Hersteller, and Werksnummer contain similar information. Herstellernummer and Werksnummer are directly connected, hence both should not be considered as different attribute. Inside the IDNummer, Herstellernummer and Werksnummer are also included and followed by the serie number of production. Hence, the IDNummer should also not be included as a different attribute. Based on this, the defect of component K7 is attributed solely from Produktionsdatum and Herstellernummer. The data set could be summarized as follow:
K7_decision_tree <- K7 %>%
dplyr::select(Produktionsdatum, Herstellernummer, Fehlerhaft) %>%
arrange(desc(Fehlerhaft))
head(K7_decision_tree)
## # A tibble: 6 × 3
## Produktionsdatum Herstellernummer Fehlerhaft
## <date> <dbl> <lgl>
## 1 2011-02-02 114 TRUE
## 2 2011-06-02 114 TRUE
## 3 2013-10-29 112 TRUE
## 4 2014-01-14 112 TRUE
## 5 2014-06-02 112 TRUE
## 6 2014-11-20 112 TRUE
tail(K7_decision_tree)
## # A tibble: 6 × 3
## Produktionsdatum Herstellernummer Fehlerhaft
## <date> <dbl> <lgl>
## 1 2016-11-12 112 FALSE
## 2 2016-11-12 112 FALSE
## 3 2016-11-12 112 FALSE
## 4 2016-11-12 112 FALSE
## 5 2016-11-12 112 FALSE
## 6 2016-11-13 112 FALSE
The next step would be to determine, which of the attributes can be considered as root characteristics. To do this, we have to measure Information Gain value \(IG(D,A)\) by applying each attribute A of both Produktionsdatum and Herstellernummer to the destination attribute D, which is in this case Fehlerhaft . This can be calculated by substracting \(H(D)\), which is the entropy of the datset of destination attribute D, with \(Rest(A)\), which is the remaining entropy that is still there after testing attribute A.
To calculate entropy of the dataset D, following equation is used,
whereby the variable \(q\) describes
the proportion of defective components to the total number of
components:
Afterwards, we can calculate the remaining entropy of each attributes (Produktionsdatum and Herstellernummer) by using following equation:
For this, \(p_{k}\) and \(n_{k}\) corresponds with the number of
defective and non-defective components in correlation with each
component of corresponding attributes. \(q_{k}\) can be calculated by dividing \(p_{k}\) with the sum of \(p_{k}\) and \(n_{k}\). By using this equation, we can
calculate the remaining entropy for both Produktionsdatum - \(Rest(Produktionsdatum)\) - and
Herstellersnummer - \(Rest(Herstellersnummer)\) and hence
determine the Information Gain for both \(IG(D, Produktionsdatum\) and \(IG(D, Produktionsdatum)\). The attribute
with the largest information gain would then be selected as the root
attribute and the other as the sub-attribute. The decision tree may look
as follow.
It is reasonable to save data in separate files instead of one huge table, because:
Provided structure represents a relational database structure. It contains reference keys, binding tables and built on table basis.
To get the Cars that were registered in ADELSHOFEN, we read the file “Zulassung” and we filter by Gemeinden ADELSHOFEN
# Reading Zulassung File
zulassung <- read_delim("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv",delim = ";",show_col_types = FALSE)
zulassung_adel=zulassung%>%filter(Gemeinden =="ADELSHOFEN")
To determine the number of cars that were build with T16 parts and got registered in ADELSHOFEN, it is necessary to know the id of the cars that have a T16 part.To start with we read the file “Bestandteile Komponente” and determine the components that have T16 parts.
# Reading T16 Teil Komponente
T16_Kom_1 <- read_delim("Data/Komponente/Bestandteile_Komponente_K2LE2.csv",
delim =";",
show_col_types = FALSE) %>%
dplyr::select("ID_T16","ID_K2LE2")
names(T16_Kom_1)[2] <- "ID_Sitze"
T16_Kom_2 <- read_delim("Data/Komponente/Bestandteile_Komponente_K2ST2.csv",
delim =";",
show_col_types = FALSE) %>%
dplyr::select("ID_T16","ID_K2ST2")
names(T16_Kom_2)[2]<-"ID_Sitze"
Once we have the components, we need to find out the car ids. For that we read the file “Bestandteile Fahrzeuge” and we combine the data (Bestandteil Komponente with Komponente Farhzeug).
Kom_K2LE2<- read.delim("Data/Komponente/Komponente_K2LE2.txt",sep ="\\")%>%dplyr::select(c("ID_Sitze","Werksnummer")) # not need to read it doesnt give any extra info
Kom_K2ST2<-read_delim("Data/Komponente/Komponente_K2ST2.csv",delim = ";",show_col_types = FALSE)%>%dplyr::select(c("ID_Sitze","Werksnummer"))
T16_K2LE2<-T16_Kom_1%>%left_join(Kom_K2LE2,by="ID_Sitze")
T16_K2ST2<-T16_Kom_2%>%left_join(Kom_K2ST2,by="ID_Sitze")
T16_Kom_bind<-rbind(T16_K2LE2,T16_K2ST2)
#read Bestandteil Fahrzeuge and combine them
Kom_Fahr_1<-read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ21.csv",delim = ";",show_col_types = FALSE)%>%dplyr::select(c("ID_Sitze","ID_Fahrzeug"))
Kom_Fahr_2<-read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ22.csv",delim =";",show_col_types = FALSE)%>%dplyr::select(c("ID_Sitze","ID_Fahrzeug"))
Kom_Fahr_bind<-rbind(Kom_Fahr_1,Kom_Fahr_2)
Finally we merge our data “Bestandteile Fahrzeuge” with the”Zulassung” data filter by ADELSHOFEN
#join the different datasets to create a dataset that only contains the IDs of the components that where produced in ADELSHOFEN
Kom_Fahr <- T16_Kom_bind %>%left_join(Kom_Fahr_bind, by ="ID_Sitze")
Kom_Fahr_Adel <-Kom_Fahr%>%inner_join(zulassung_adel,by =c("ID_Fahrzeug"="IDNummer"))
Number of Cars register in ADELSHOFEN:
cat("Cars registered in Adelshofen is ",nrow(Kom_Fahr_Adel))
## Cars registered in Adelshofen is 48
The data type of the table “Zulassung” is shown with the str() function
str(zulassung)
## spec_tbl_df [3,204,104 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:3204104] 408097 408098 1 2 3 ...
## $ IDNummer : chr [1:3204104] "11-1-11-1" "11-1-11-2" "12-1-12-1" "12-1-12-2" ...
## $ Gemeinden: chr [1:3204104] "DRESDEN" "DRESDEN" "LEIPZIG" "LEIPZIG" ...
## $ Zulassung: Date[1:3204104], format: "2009-01-01" "2009-01-01" ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. IDNummer = col_character(),
## .. Gemeinden = col_character(),
## .. Zulassung = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
The data types are explained in the following table:
| Attribute | Data.Type | Description |
|---|---|---|
| …1 | Numeric | This data type contains numbers that represents unique index of each registeration |
| IDNummer | Character | This data type contains strings with vehicle type, its OEM number, its factory, and its unique index |
| Gemeinden | Character | This data type contains strings of registration location |
| Zulassung | Date | This data type contains date of registration |
Q: Why does it make sense to store the records on the database of a server?
A: To save the records on the database of a server has the advantage that no local memory is occupied but more importantly, a permanent access to the most recently data records can be guaranteed in most cases , with a few exception as for example when maintenance takes place.
Q: Why can’t you store the records on your personal computer?
A: Since databases tend to store a large amount of data, this could cause the computer to slow down. Also the quality of the data deteriorate very fast, since the data doesn’t get updated.
Q: What is an easy way to make your application available to your customers?
A: “RStudio Shiny Server Open Source” allows customers to get access to the app from anywhere with any web browser. To share the app, the code can be uploaded to git, for example, as a public repository.
In order to find out where car with the chassis “K5-112-1122-79” got registered, it is important to know the car ID. This is possible by reading the file “Bestandteile Fahrzeuge”.
car <- read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ12.csv",delim =";",
show_col_types = FALSE) %>%
dplyr::select(c("ID_Karosserie","ID_Fahrzeug")) %>%
filter(ID_Karosserie == "K5-112-1122-79")
car_registerd <- filter(zulassung, IDNummer == car$ID_Fahrzeug)
gemeinden <- as.character(car_registerd$Gemeinden)
cat("Car with ID Karosserie'K5-112-1122-79' got registered in ",
gemeinden)
## Car with ID Karosserie'K5-112-1122-79' got registered in ASCHERSLEBEN